Module 1

Introduction to machine learning

Game Plan

This module provides the basis for the rest of the course by introducing the basic concepts behind machine learning, and, specifically, how to perform machine learning by using RStudio and the CARET R package. First, you will learn how machine learning and artificial intelligence are disrupting businesses. Next, you will learn about the basic types of machine learning and how to leverage these algorithms in a R script. Third, you will learn how linear regression can be considered a machine learning problem with parameters that must be determined computationally by minimizing a cost function. Finally, you will learn about neighbor-based algorithms, including the k-nearest neighbor algorithm, which can be used for both classification and regression tasks.

Objectives

By the end of this module, you should be able to:

Artificial Intelligence and Accountancy

This lesson explores the fundamentals of machine learning and artificial intelligence and how these tools are being used in accountancy and business in general.

Interesting Articles

How is accountancy and finance world using artificial intelligence

Financial Statement Audits

  1. What is the growing role artificial intelligence (and by association, machine learning) play in modern accounting?

  2. What does the impact technology plays in shaping careers in modern accountancy?

  3. How is artificial intelligence impacting financial auditing?

Introduction to Machine Learning1

Your most important skill will be your ability to translate data into insights that are clear and meaningful to a stakeholder.

The Four Types of Data Analysis are:

1. Descriptive Analytics: What is happening?

This is the most common of all forms. In business, it provides the analyst with a view of key metrics and measures within the company.

An example of this could be a monthly profit and loss statement. Similarly, an analyst could have data on a large population of customers. Understanding demographic information on their customers (e.g. 30% of our customers are self-employed) would be categorised as “descriptive analytics”. Utilising useful visualisation tools enhances the message of descriptive analytics.

2. Diagnostic Analytics: Why is it happening?

This is the next step in complexity in data analytics is descriptive analytics. On the assessment of the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem.

Well-designed business information (BI) dashboards incorporating reading of time-series data (i.e. data over multiple successive points in time) and featuring filters and drill down capability allow for such analysis.

3. Predictive Analytics: What is likely to happen?

Predictive analytics is all about forecasting. Whether it’s the likelihood of an event happening in future, forecasting a quantifiable amount or estimating a point in time at which something might happen - these are all done through predictive models.

Predictive models typically utilise a variety of variable data to make the prediction. The variability of the component data will have a relationship with what it is likely to predict (e.g. the older a person, the more susceptible they are to a heart-attack – we would say that age has a linear correlation with heart-attack risk). These data are then compiled together into a score or prediction.

In a world of significant uncertainty, being able to predict allows one to make better decisions. Predictive models are some of the most important utilised across many fields.

Common Issues with Prediction

4. Prescriptive Analytics: What do I need to do?

The next step up regarding value and complexity is the prescriptive model.  The prescriptive model utilises an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take. A prescriptive analysis is typically not just with one individual response but is, in fact, a host of other actions.

An excellent example of this is a traffic application helping you choose the best route home and taking into account the distance of each route, the speed at which one can travel on each road and, crucially, the current traffic constraints.

Another example might be producing an exam time-table such that no students have clashing schedules.

Getting our hands dirty

As a field, machine learning is both expansive and mathematically complex. From deriving simple linear relationships via regression analysis to finding clusters of data points in an N-dimensional space; statistical and machine learning techniques can take years to fully master. However, given the short time available in this course, we will take the simpler approach of demonstrating several commonly used approaches in order to both introduce the fundamental concepts in machine learning and the methodology we will use in RStudio to apply these concepts to actual data. For the latter, we will use the standard machine learning library in R, which is the Caret Package.

We will demonstrate the four main tasks of machine learning: classification, regression, dimensional reduction, and clustering. Note that this module is simply an introduction to these topics, we will explore these and other areas in more detail throughout this course. Finally, we will discuss how to persist machine learning models.

Set up

Make sure you have the caret package installed.

install.packages("caret")

Run your libraries

The first steps in any data analytics effort, once the business goal has been defined, are to understand and prepare the data of interest. For example, the Cross Industry Standard Process for Data Mining or (CRISP-DM) starts with the Business Understanding step, immediately followed by the Data Understanding and Data Preparation steps. For machine learning analyses, these latter two steps require loading the data into our notebook, exploring the data either systematically or in a cumulative sense to understand the typical features for different instances. We also can generate descriptive statistical summaries and visualizations, such as a pair plot, to understand the data in full. Finally, we will need to clean the data to properly account for missing data, data that are incomplete or formatted incorrectly, or to generate meta-features (such as a date-time) from existing features.

For this module, we will focus on a single, simple data set, the standard Iris dataset, which is included by default. Note that given a data set, such as the Iris data, we have rows, which correspond to different instances (e.g., different flowers), and columns, which correspond to different features of the instances (e.g., different measurements of the flowers). To understand the data, we first load this data into RStudio, before looking at several instances from the data. Next, we will group the data by species to explore cumulative quantities, before extracting a statistical summary of the entire data set. Finally, we will generate a pair plot to visually explore the data. Since this data has already been cleaned (and only consists of four features) we will not need to perform additional tasks.

Load the data

#load the data
iris<-iris

The data set consists of 150 total measurements of three different types of Iris flowers, equally divided between three classes: Iris setosa, Iris versicolor, and Iris virginica. Before proceeding, we can examine the DataFrame that contains these data to view typical instances, to see a cumulative summary, and a brief statistical summary.

#examine the top 5 rows
head(iris,5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
#view the whole dataset
knitr::kable(iris)%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica
#examine grouped data
iris%>%
  group_by(Species)%>%
  summarise(count=n())
# A tibble: 3 x 2
  Species    count
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
# Get descriptive statistics
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Another handy package

install.packages("psych")
psych::describe(iris)
             vars   n mean   sd median trimmed  mad min max range
Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6
Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4
Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9
Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4
Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0
              skew kurtosis   se
Sepal.Length  0.31    -0.61 0.07
Sepal.Width   0.31     0.14 0.04
Petal.Length -0.27    -1.42 0.14
Petal.Width  -0.10    -1.36 0.06
Species*      0.00    -1.52 0.07
# ?psych::describe

Look up what mad, se, skewness, and kurtosis is… 🕵️


As demonstrated by the output from the previous code cells, our test data matches our expectations (note that the full Iris data set is listed on Wikipedia). These data consist of three types, each with fifty instances, and every row has four measured features (i.e., attributes). The four primary features of the data are Sepal Length, Sepal Width, Petal Length, and Petal Width. In simple terms, petals are the showy, colorful part of the Iris flower, while the sepals provide protection and support for the petals.

In addition, our cursory exploration of the DataFrame indicated the data are clean. One simple way to verify this is that the count is the same for every feature, and the descriptive statistics (e.g., min, max, and mean) are all numerical. If we had missing or bad data in our DataFrame, these measures would generally indicate the problem. If there were missing data, we could drop any instance with missing data by using the na.omit method, or alternatively insert a value by using mutate . An alternative, and powerful, technique for handling missing data is known as imputing, where we apply machine learning2 to generate realistic values for any missing data. This approach will be demonstrated in a subsequent module.

At this point, we have loaded our data, and verified the data are clean. The next step is to visualize the relationships between the different features in our data.

Lets use another package 😆

install.packages("GGally")

Look at me 👀 ggpairs function 👀

GGally::ggpairs(iris)

GGally::ggpairs(iris, aes(color = Species, alpha = 0.5))

These figures indicate that the three Iris species cluster naturally in these dimensions, with minimal overlap. As a result, these data provide an excellent test for different machine learning algorithms.

First, however, we will generate one scatter plot that displays a larger version of the Sepal Width versus Petal Width scatter plot to highlight the inherent structure in these data. Furthermore, we will refer back to this plot in later analyses in this Module.

#crash course in plots?
ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Species))+geom_point(alpha=0.5)

Side Note…Other plots 🙈

Line Graph

#line graph..this is a TERRIBLE example...
#ask me why
iris.setosa<-iris%>%
  filter(Species=="setosa")%>%
  distinct(Sepal.Width,.keep_all = TRUE)

ggplot(data=iris.setosa,mapping = aes(x=Sepal.Width,y=Sepal.Length))+geom_line()

Histograms

ggplot(data=iris,mapping = aes(x=Sepal.Length,color=Species))+geom_histogram()

Boxplots

ggplot(data=iris,mapping = aes(x=Species,y=Sepal.Length))+geom_boxplot()

Bar plots

ggplot(data=iris,mapping = aes(x=Species))+geom_bar()

Introducing Machine Learning

Machine learning algorithms can be classified by the method in which they are constructed. Supervised learning methods use training data to build a model, which is subsequently applied to additional data. On the other hand, unsupervised learning methods seek relationships among data points that can be leveraged to construct a model that is subsequently applied to the data of interest. In some cases, training data are used to validate the effectiveness of an unsupervised method, or perhaps to provide some level of supervision, which is known as semi-supervised learning.

More recently, additional types of learning have been developed. First, transfer learning extends a model trained on a previous data set to new, related data. This can be viewed as learning by analogy, which is similar to how humans learn. Second, reinforcement learning is a technique that explores how agents should behave within an environment by maximizing a cumulative reward. Finally, deep learning applies artificial neural networks that have multiple hidden layers to complex tasks, often with spectacular success in areas from image recognition to natural language processing.

Broadly speaking, the application of a machine learning algorithm will be one of four different categories:

  1. Classification: generates a model that predicts discrete categories for new, unseen data.

  2. Regression: generates a model that predicts continuous values for new, unseen data.

  3. Dimensionality reduction: identifies (and optionally ranks) the most important (potentially new) features (or dimensions) for a data set.

  4. Clustering: identifies clusters of instances in an N-dimensional feature space.

One final point to clarify before proceeding with demonstrations of these different algorithm categories. When applying a machine learning algorithm to a problem, we often need to specify both model parameters and model hyperparameters. While they are similar, the difference between these two types of information depends on whether the value can be estimated from the data.

Parameter

A value that can be estimated from the data being analyzed and that is internal to the machine learning algorithm. A parameter is generally not specified by the programmer, and instead is determined automatically by the algorithm implementation (e.g., directly in the caret package). For example, the coefficients in a linear regression model are machine learning parameters.

Hyperparameter

A value that cannot be estimated from the data being analyzed and that is external to a specific machine learning algorithm. A hyperparameter is generally specified by the programmer prior to the start of the learning process. As a result, the hyperparameter directly influences the performance of the algorithm and thus is a tunable parameter. For example, the number of neighbors in a k-nearest neighbors implementation is a hyperparameter.

Introducing Caret3

Caret stands for Classification And Regression Training. Apparently caret has little to do with our orange friend, the carrot. 🥕

Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:

An extensive vignette for caret can be found here: https://topepo.github.io/caret/index.html

Data Pre-Processing

Before we can apply a machine learning algorithm to the data of interest, we must divide the data into training and testing data sets. The training data are used to generate the supervised model, while the testing data are used to quantify the quality of the generated model. The function createDataPartition can be used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data. For example, to create a single 60/40% split of the iris data:

#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(iris$Species, p = .6, list = FALSE, times = 1)

#look at the first few
head(trainIndex)
     Resample1
[1,]         1
[2,]         2
[3,]         3
[4,]         4
[5,]         5
[6,]         7
#grab the data
irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]

Data Scaling

Many machine learning estimators are sensitive to variations in the spread of features within a data set. For example, if all features but one span similar ranges (e.g., zero through one) and one feature spans a much larger range (e.g., zero through one hundred), an algorithm might focus on the one feature with a larger spread, even if this produces a sub-optimal result. To prevent this, we generally scale the features to improve the performance of a given estimator.

Data scaling can take several forms:

One important caveat to scaling is that any scaling technique should be trained via the fit method on the training data used for the machine learning algorithm. Once trained, the scaling technique can be applied equally to the training and testing data. In this manner, the testing data will always match the space spanned by the training data, which is what is used to generate the predictive model.

We demonstrate this approach in the following code cell, where we compute a standardization from our training data. This transformation is applied to both the training and testing data.

preProcValues <- preProcess(irisTrain, method = c("center", "scale"))

trainTransformed <- predict(preProcValues, irisTrain)
testTransformed <- predict(preProcValues, irisTest)

I made a mistake here…Can you spot it 👀

preProcValues <- preProcess(irisTest, method = c("center", "scale"))
testTransformed <- predict(preProcValues, irisTest)
psych::describe(trainTransformed)
             vars  n mean   sd median trimmed  mad   min  max range
Sepal.Length    1 90    0 1.00  -0.03   -0.04 1.07 -1.84 2.50  4.34
Sepal.Width     2 90    0 1.00  -0.07   -0.04 0.71 -1.97 2.79  4.76
Petal.Length    3 90    0 1.00   0.42    0.01 0.92 -1.50 1.77  3.27
Petal.Width     4 90    0 1.00   0.25   -0.02 1.14 -1.42 1.66  3.08
Species*        5 90    2 0.82   2.00    2.00 1.48  1.00 3.00  2.00
              skew kurtosis   se
Sepal.Length  0.24    -0.61 0.11
Sepal.Width   0.39     0.16 0.11
Petal.Length -0.34    -1.45 0.11
Petal.Width  -0.11    -1.35 0.11
Species*      0.00    -1.53 0.09
psych::describe(testTransformed)
             vars  n mean   sd median trimmed  mad   min  max range
Sepal.Length    1 60    0 1.00  -0.20   -0.05 1.07 -1.77 2.20  3.97
Sepal.Width     2 60    0 1.00   0.00   -0.01 0.97 -2.40 2.84  5.24
Petal.Length    3 60    0 1.00   0.20   -0.01 1.22 -1.56 1.67  3.23
Petal.Width     4 60    0 1.00   0.15   -0.01 1.40 -1.47 1.76  3.23
Species*        5 60    2 0.82   2.00    2.00 1.48  1.00 3.00  2.00
              skew kurtosis   se
Sepal.Length  0.41    -0.71 0.13
Sepal.Width   0.18     0.03 0.13
Petal.Length -0.15    -1.42 0.13
Petal.Width  -0.08    -1.44 0.13
Species*      0.00    -1.55 0.11

With our data properly divided into training and testing samples, and the features appropriately scaled, we now change to the application of machine learning algorithms

Classification

The first type of algorithm we will demonstrate is classification, where we train an estimator to generate a model for the prediction of discrete labels. The following code cell completes this task by performing k-Nearest Neighbors classification. In this example, we use five nearest neighbors (but this value can be easily adjusted to see how the classification performance changes). As demonstrated in this code example, the standard classification process in caret is to first fit a model to the training data and to subsequently apply this model to predict values for the testing data. We can compute an accuracy measurement for our trained algorithm to compare the predicted and known labels for the testing data.

Since we set the k there is no reason to actually train… 😖

#fit knn
knn_fit<-train(Species~.,
               data=trainTransformed,
               method="knn",
               tuneGrid=data.frame(k=5))

knn_fit
k-Nearest Neighbors 

90 samples
 4 predictor
 3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... 
Resampling results:

  Accuracy  Kappa    
  0.877429  0.8148539

Tuning parameter 'k' was held constant at a value of 5
#predict on the test set
knn_pred<-predict(knn_fit,testTransformed)

#confusion
confusionMatrix(knn_pred,testTransformed$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         20          0         0
  versicolor      0         20         0
  virginica       0          0        20

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9404, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000

Regression

The second machine learning application we will demonstrate is regression. To demonstrate regression, we will introduce the Decision Tree. A decision tree simply asks a set of questions of the data, and based on the answers, constructs a model representation. The tree (or model) is constructed by recursively splitting a data set into new groupings based on a statistical measure of the data along each different dimension (popular measures include the Gini coefficient or the entropy).

The terminal nodes in the tree are known as leaf nodes, and they provide the final predictions. In the simplest form, the leaf node simply provides the final prediction. More realistic decision trees generate a model prediction by using all instances in the leaf node, for example by averaging across them.

Before generating a regression model, however, we must pre-process our data to identify our independent variables (or features) and our dependent variable (or feature). Given a set of new independent variables, a regression model will predict the dependent variable. In the following code cell, we first select the first three features to be our independent variables and the fourth variable to be our dependent variable. We divide these into training and testing samples.

#fit Decision Tree
DT_fit1<-train(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,
               data=trainTransformed,
               method="rpart")

DT_fit1
CART 

90 samples
 3 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... 
Resampling results across tuning parameters:

  cp          RMSE       Rsquared   MAE      
  0.01978007  0.3518719  0.8755446  0.2591553
  0.11170251  0.4219235  0.8169280  0.3243089
  0.79841276  0.7307899  0.7711550  0.6083211

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01978007.

Install another package

install.packages("rpart.plot")

We can plot simple trees

rpart.plot::prp(DT_fit1$finalModel,box.palette = "Reds", tweak = 1.2)

Lets predict

#predict on the test set
DTfit1_pred<-predict(DT_fit1,testTransformed)

Rsquared 😈

preds<-DTfit1_pred
actual<-testTransformed$Petal.Width
#one formulation
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
rsq
[1] 0.9372585
#another
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
[1] 0.9152572
#another
cor(preds,actual)^2
[1] 0.9373907

Time to blow RSquared up4 💥

R-squared is a statistic that often accompanies regression output. It ranges in value from 0 to 1 and is usually interpreted as summarizing the percent of variation in the response that the regression model explains. So an R-squared of 0.65 might mean that the model explains about 65% of the variation in our dependent variable. Given this logic, we prefer our regression models have a high R-squared.

In R, we typically get R-squared by calling the summary function on a model object. Here’s a quick example using simulated data:

# independent variable
x <- 1:20 
# for reproducibility
set.seed(1) 
# dependent variable; function of x with random error
y <- 2 + 0.5*x + rnorm(20,0,3) 
# simple linear regression
mod <- lm(y~x)
# request just the r-squared value
summary(mod)$r.squared          
[1] 0.6026682

One way to express R-squared is as the sum of squared fitted-value deviations divided by the sum of squared original-value deviations:

\[ R^{2} = \frac{\sum (\hat{y} – \bar{\hat{y}})^{2}}{\sum (y – \bar{y})^{2}} \]

We can calculate it directly using our model object like so:

# extract fitted (or predicted) values from model
f <- mod$fitted.values
# sum of squared fitted-value deviations
mss <- sum((f - mean(f))^2)
# sum of squared original-value deviations
tss <- sum((y - mean(y))^2)
# r-squared
mss/tss                      
[1] 0.6026682

1. R-squared does not measure goodness of fit. It can be arbitrarily low when the model is completely correct. By making\(σ^2\) large, we drive R-squared towards 0, even when every assumption of the simple linear regression model is correct in every particular.

What is \(σ^2\)? When we perform linear regression, we assume our model almost predicts our dependent variable. The difference between “almost” and “exact” is assumed to be a draw from a Normal distribution with mean 0 and some variance we call \(σ^2\).

This statement is easy enough to demonstrate. The way we do it here is to create a function that (1) generates data meeting the assumptions of simple linear regression (independent observations, normally distributed errors with constant variance), (2) fits a simple linear model to the data, and (3) reports the R-squared. Notice the only parameter for sake of simplicity is sigma. We then “apply” this function to a series of increasing \(σ\) values and plot the results.

r2.0 <- function(sig){
  # our predictor
  x <- seq(1,10,length.out = 100)   
  # our response; a function of x plus some random noise
  y <- 2 + 1.2*x + rnorm(100,0,sd = sig) 
  # print the R-squared value
  summary(lm(y ~ x))$r.squared          
}

sigmas <- seq(0.5,20,length.out = 20)
 # apply our function to a series of sigma values
rout <- sapply(sigmas, r2.0)            
plot(rout ~ sigmas, type="b")

R-squared tanks hard with increasing sigma, even though the model is completely correct in every respect.

  1. R-squared can be arbitrarily close to 1 when the model is totally wrong.

The point being made is that R-squared does not measure goodness of fit.

set.seed(1)
# our predictor is data from an exponential distribution
x <- rexp(50,rate=0.005)
# non-linear data generation
y <- (x-1)^2 * runif(50, min=0.8, max=1.2) 
# clearly non-linear
plot(x,y)             

summary(lm(y ~ x))$r.squared
[1] 0.8485146

It’s very high at about 0.85, but the model is completely wrong. Using R-squared to justify the “goodness” of our model in this instance would be a mistake. Hopefully one would plot the data first and recognize that a simple linear regression in this case would be inappropriate.

3. R-squared says nothing about prediction error, even with \(σ^2\) exactly the same, and no change in the coefficients. R-squared can be anywhere between 0 and 1 just by changing the range of X. We’re better off using Mean Square Error (MSE) as a measure of prediction error.

MSE is basically the fitted y values minus the observed y values, squared, then summed, and then divided by the number of observations.

Let’s demonstrate this statement by first generating data that meets all simple linear regression assumptions and then regressing y on x to assess both R-squared and MSE.

x <- seq(1,10,length.out = 100)
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
mod1 <- lm(y ~ x)
summary(mod1)$r.squared
[1] 0.9383379
# Mean squared error
sum((fitted(mod1) - y)^2)/100
[1] 0.6468052

Now repeat the above code, but this time with a different range of x. Leave everything else the same:

 # new range of x
x <- seq(1,2,length.out = 100)      
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
mod1 <- lm(y ~ x)
summary(mod1)$r.squared
[1] 0.1502448
# Mean squared error
sum((fitted(mod1) - y)^2)/100        
[1] 0.6468052

The R-squared falls from 0.94 to 0.15 but the MSE remains the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power.

  1. R-squared can easily go down when the model assumptions are better fulfilled.

Let’s examine this by generating data that would benefit from transformation. Notice the R code below is very much like our previous efforts but now we exponentiate our y variable.

x <- seq(1,2,length.out = 100)
set.seed(1)
y <- exp(-2 - 0.09*x + rnorm(100,0,sd = 2.5))
summary(lm(y ~ x))$r.squared
[1] 0.003281718
plot(lm(y ~ x), which=3)

R-squared is very low and our residuals vs. fitted plot reveals outliers and non-constant variance. A common fix for this is to log transform the data. Let’s try that and see what happens:

plot(lm(log(y)~x),which = 3) 

The diagnostic plot looks much better. Our assumption of constant variance appears to be met. But look at the R-squared:

summary(lm(log(y)~x))$r.squared 
[1] 0.0006921086

It’s even lower! This is an extreme case and it doesn’t always happen like this. In fact, a log transformation will usually produce an increase in R-squared. But as just demonstrated, assumptions that are better fulfilled don’t always lead to higher R-squared.

  1. It is very common to say that R-squared is “the fraction of variance explained” by the regression. \[Yet\] if we regressed X on Y, we’d get exactly the same R-squared. This in itself should be enough to show that a high R-squared says nothing about explaining one variable by another.

This is the easiest statement to demonstrate:

x <- seq(1,10,length.out = 100)
y <- 2 + 1.2*x + rnorm(100,0,sd = 2)
summary(lm(y ~ x))$r.squared
[1] 0.737738
summary(lm(x ~ y))$r.squared
[1] 0.737738

Does x explain y, or does y explain x? Are we saying “explain” to dance around the word “cause”? In a simple scenario with two variables such as this, R-squared is simply the square of the correlation between x and y:

all.equal(cor(x,y)^2, summary(lm(x ~ y))$r.squared, summary(lm(y ~ x))$r.squared)
[1] TRUE

Let’s recap:

Dimensionality Reduction

When confronted with a large, multi-dimensional data set, one approach to simplify any subsequent analysis is to reduce the number of dimensions (or features) that must be processed. In some cases, features can be removed from an analysis based on business logic, or the features that contain the most information can be quantified somehow. More generally, however, we can employ dimensional reduction, a machine learning technique that quantifies relationships between the original dimensions (or features, attributes, or columns of a DataFrame) to identify new dimensions that better capture the inherent relationships within the data.

The standard technique to perform this is known as principal component analysis, or PCA. Mathematically, we can derive PCA by using linear algebra to solve a set of linear equations. This process effectively rotates the data into a new set of dimensions, and by ranking the importance of the new dimensions, we can optimally select fewer dimensions for use in other machine learning algorithms.

The PCA estimator requires one tunable hyper-parameter that specifies the target number of dimensions. This value can be arbitrarily selected, perhaps based on prior information, or it can be iteratively determined. After the model is created, we fit the model to the data and next create our new, rotated data set. This is demonstrated in the next code cell.

library(caret)
#store our data in another object
dat <- iris
#take the 4 continuous variables and perform PCA
caret.pca <- preProcess(dat[,-5], method="pca",pcaComp=2)

caret.pca
Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)
  - principal component signal extraction (4)
  - scaled (4)

PCA used 2 components as specified
caret.pca$
#use that data to form our new inputs
dat2 <- predict(caret.pca, dat[,-5])


#using stats
stat.pca <- prcomp(dat[,-5],
                 center = TRUE,
                 scale. = TRUE) 

# plot method
plot(stat.pca, type = "l")
summary(stat.pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Below is a graphical representation5

At the end of the previous code cell, we measure the amount of the original variance (or spread) in the original data that is captured by each new dimension. As this example shows, these two new dimensions capture almost 96% of the variance in the original data. This means that any analysis that uses only these two new dimensions will closely represent the analysis if performed on the entire data.

Clustering

The last machine learning technique we will explore in this notebook is cluster finding. In this introductory notebook, we will demonstrate one of the simplest clustering techniques, spatial clustering, which seeks to first find NN clusters in a data set and to subsequently identify to which cluster each instance (or data point) belongs. The specific algorithm we employ below is the k-means algorithm, which is one of the simplest to understand. In this algorithm, we start with a guess for the number of clusters (again this can be based on prior information or iteratively quantified). We randomly place cluster centers in the data and determine how well the data cluster to these cluster centers. This information is used to pick new cluster centers, and the process continues until a solution converges (or we reach a predefined number of iterations).

Clusters<-kmeans(trainTransformed[,-5],centers=3)

Clusters
K-means clustering with 3 clusters of sizes 35, 25, 30

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   0.06556411  -0.8164251    0.4313175   0.3596444
2   1.17828077   0.2357890    0.9811063   1.0044368
3  -1.05839210   0.7560051   -1.3207923  -1.2566157

Clustering vector:
  1   2   3   4   5   7   8   9  11  13  14  15  17  21  22  26  28 
  3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
 30  31  33  34  35  36  38  39  41  42  44  46  50  51  52  53  54 
  3   3   3   3   3   3   3   3   3   3   3   3   3   2   2   2   1 
 55  57  59  62  64  67  68  69  73  74  75  76  78  79  80  82  83 
  1   2   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1 
 84  85  87  88  89  90  94  95 100 103 104 105 107 108 110 114 115 
  1   1   2   1   1   1   1   1   1   2   2   2   1   2   2   1   1 
116 119 120 122 124 130 132 133 134 135 136 137 138 139 141 143 144 
  2   2   1   1   1   2   2   2   1   1   2   2   2   1   2   1   2 
145 146 147 149 150 
  2   2   1   2   1 

Within cluster sum of squares by cluster:
[1] 27.67946 25.20647 32.75160
 (between_SS / total_SS =  75.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

The above list is an output of the kmeans() function. Let’s see some of the important ones closely:

library(tidyverse)

Clusterdata<-trainTransformed
Clusterdata$Cluster<-as.factor(Clusters$cluster)

#view the whole dataset
knitr::kable(Clusterdata)%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Cluster
1 -0.8737422 1.1207908 -1.3320651 -1.2908481 setosa 3
2 -1.1145899 -0.0687277 -1.3320651 -1.2908481 setosa 3
3 -1.3554377 0.4070797 -1.3884289 -1.2908481 setosa 3
4 -1.4758616 0.1691760 -1.2757013 -1.2908481 setosa 3
5 -0.9941660 1.3586945 -1.3320651 -1.2908481 setosa 3
7 -1.4758616 0.8828871 -1.3320651 -1.1624765 setosa 3
8 -0.9941660 0.8828871 -1.2757013 -1.2908481 setosa 3
9 -1.7167093 -0.3066314 -1.3320651 -1.2908481 setosa 3
11 -0.5124705 1.5965982 -1.2757013 -1.2908481 setosa 3
13 -1.2350138 -0.0687277 -1.3320651 -1.4192198 setosa 3
14 -1.8371332 -0.0687277 -1.5011566 -1.4192198 setosa 3
15 -0.0307750 2.3103093 -1.4447927 -1.2908481 setosa 3
17 -0.5124705 2.0724056 -1.3884289 -1.0341049 setosa 3
21 -0.5124705 0.8828871 -1.1629736 -1.2908481 setosa 3
22 -0.8737422 1.5965982 -1.2757013 -1.0341049 setosa 3
26 -0.9941660 -0.0687277 -1.2193374 -1.2908481 setosa 3
28 -0.7533183 1.1207908 -1.2757013 -1.2908481 setosa 3
30 -1.3554377 0.4070797 -1.2193374 -1.2908481 setosa 3
31 -1.2350138 0.1691760 -1.2193374 -1.2908481 setosa 3
33 -0.7533183 2.5482130 -1.2757013 -1.4192198 setosa 3
34 -0.3920466 2.7861168 -1.3320651 -1.2908481 setosa 3
35 -1.1145899 0.1691760 -1.2757013 -1.2908481 setosa 3
36 -0.9941660 0.4070797 -1.4447927 -1.2908481 setosa 3
38 -1.1145899 1.3586945 -1.3320651 -1.4192198 setosa 3
39 -1.7167093 -0.0687277 -1.3884289 -1.2908481 setosa 3
41 -0.9941660 1.1207908 -1.3884289 -1.1624765 setosa 3
42 -1.5962854 -1.7340537 -1.3884289 -1.1624765 setosa 3
44 -0.9941660 1.1207908 -1.2193374 -0.7773616 setosa 3
46 -1.2350138 -0.0687277 -1.3320651 -1.1624765 setosa 3
50 -0.9941660 0.6449834 -1.3320651 -1.2908481 setosa 3
51 1.4143116 0.4070797 0.5279412 0.2496115 versicolor 2
52 0.6917683 0.4070797 0.4152135 0.3779832 versicolor 2
53 1.2938877 0.1691760 0.6406688 0.3779832 versicolor 2
54 -0.3920466 -1.7340537 0.1333944 0.1212399 versicolor 1
55 0.8121922 -0.5445352 0.4715773 0.3779832 versicolor 1
57 0.5713444 0.6449834 0.5279412 0.5063548 versicolor 2
59 0.9326161 -0.3066314 0.4715773 0.1212399 versicolor 1
62 0.0896489 -0.0687277 0.2461220 0.3779832 versicolor 1
64 0.3304966 -0.3066314 0.5279412 0.2496115 versicolor 1
67 -0.2716228 -0.0687277 0.4152135 0.3779832 versicolor 1
68 -0.0307750 -0.7824389 0.1897582 -0.2638750 versicolor 1
69 0.4509205 -1.9719574 0.4152135 0.3779832 versicolor 1
73 0.5713444 -1.2582463 0.6406688 0.3779832 versicolor 1
74 0.3304966 -0.5445352 0.5279412 -0.0071318 versicolor 1
75 0.6917683 -0.3066314 0.3024859 0.1212399 versicolor 1
76 0.9326161 -0.0687277 0.3588497 0.2496115 versicolor 2
78 1.0530399 -0.0687277 0.6970326 0.6347264 versicolor 2
79 0.2100728 -0.3066314 0.4152135 0.3779832 versicolor 1
80 -0.1511989 -1.0203426 -0.1484247 -0.2638750 versicolor 1
82 -0.3920466 -1.4961500 -0.0356971 -0.2638750 versicolor 1
83 -0.0307750 -0.7824389 0.0770306 -0.0071318 versicolor 1
84 0.2100728 -0.7824389 0.7533965 0.5063548 versicolor 1
85 -0.5124705 -0.0687277 0.4152135 0.3779832 versicolor 1
87 1.0530399 0.1691760 0.5279412 0.3779832 versicolor 2
88 0.5713444 -1.7340537 0.3588497 0.1212399 versicolor 1
89 -0.2716228 -0.0687277 0.1897582 0.1212399 versicolor 1
90 -0.3920466 -1.2582463 0.1333944 0.1212399 versicolor 1
94 -0.9941660 -1.7340537 -0.2611524 -0.2638750 versicolor 1
95 -0.2716228 -0.7824389 0.2461220 0.1212399 versicolor 1
100 -0.1511989 -0.5445352 0.1897582 0.1212399 versicolor 1
103 1.5347355 -0.0687277 1.2043071 1.1482130 virginica 2
104 0.5713444 -0.3066314 1.0352156 0.7630981 virginica 2
105 0.8121922 -0.0687277 1.1479433 1.2765846 virginica 2
107 -1.1145899 -1.2582463 0.4152135 0.6347264 virginica 1
108 1.7755832 -0.3066314 1.4297624 0.7630981 virginica 2
110 1.6551593 1.3586945 1.3170347 1.6616995 virginica 2
114 -0.1511989 -1.2582463 0.6970326 1.0198413 virginica 1
115 -0.0307750 -0.5445352 0.7533965 1.5333279 virginica 1
116 0.6917683 0.4070797 0.8661241 1.4049563 virginica 2
119 2.2572787 -1.0203426 1.7679453 1.4049563 virginica 2
120 0.2100728 -1.9719574 0.6970326 0.3779832 virginica 1
122 -0.2716228 -0.5445352 0.6406688 1.0198413 virginica 1
124 0.5713444 -0.7824389 0.6406688 0.7630981 virginica 1
130 1.6551593 -0.0687277 1.1479433 0.5063548 virginica 2
132 2.4981265 1.8345019 1.4861262 1.0198413 virginica 2
133 0.6917683 -0.5445352 1.0352156 1.2765846 virginica 2
134 0.5713444 -0.5445352 0.7533965 0.3779832 virginica 1
135 0.3304966 -1.0203426 1.0352156 0.2496115 virginica 1
136 2.2572787 -0.0687277 1.3170347 1.4049563 virginica 2
137 0.5713444 0.8828871 1.0352156 1.5333279 virginica 2
138 0.6917683 0.1691760 0.9788518 0.7630981 virginica 2
139 0.2100728 -0.0687277 0.5843050 0.7630981 virginica 1
141 1.0530399 0.1691760 1.0352156 1.5333279 virginica 2
143 -0.0307750 -0.7824389 0.7533965 0.8914697 virginica 1
144 1.1734638 0.4070797 1.2043071 1.4049563 virginica 2
145 1.0530399 0.6449834 1.0915794 1.6616995 virginica 2
146 1.0530399 -0.0687277 0.8097603 1.4049563 virginica 2
147 0.5713444 -1.2582463 0.6970326 0.8914697 virginica 1
149 0.4509205 0.8828871 0.9224880 1.4049563 virginica 2
150 0.0896489 -0.0687277 0.7533965 0.7630981 virginica 1
#Remember me
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Cluster))+geom_point(alpha=0.5)
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Cluster))+geom_point(alpha=0.5)+facet_wrap(~Species)
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Species))+geom_point(alpha=0.5) + 
   geom_point(data=as.data.frame(Clusters$centers), aes(color="Cluster center"), size=5) + theme(legend.title = element_blank())+ggtitle("Iris Cluster Demonstration")

Exercise 1

Using the code above, answer the following questions.

  1. Change the p=.6 to p=.75 in the \[Data Pre-Processing\] section. How did the classification results change?

  2. Change the p=.6 to p=.4 in the \[Data Pre-Processing\] section. How did the classification results change?

  3. Change the k hyper-parameter in the k-nn estimator to three (and ten). In other words change the 5 to 10 in tuneGrid=data.frame(k=5) in the \[Classification\] section. How did the classification results change?

  4. Change the p=.6 to p=.75 in the \[Data Pre-Processing\] section. How did the regression results change?

  5. Change the pcaComp hyper-parameter in the PCA code example to three (and four) in the \[Dimensionality Reduction\] section. What are the new explained variances?

  6. Change the centers hyper-parameter in the cluster finding code example to two (and four) in the \[Clustering\] section. Where are the new cluster centers? Does this look better or worse?

  7. What does the set.seed function in R do? Why use it? Should we have used it above?

Model Persistence


As the previous code cells demonstrate, we can generate machine learning models rather easily for small data sets by using the caret library. For larger data sets, however, either in the number of instances, the number of features, or both, building a quality machine learning model can take considerable time and effort. As a result, we may wish to persist a trained machine learning model so that it can be applied to new data at a later time.

#save our model
save(DT_fit1,file = "DT_fit1.Rda")

#remove it from  the environment
rm(DT_fit1)
#load our model
load(file = "DT_fit1.Rda")

Whenever you load it back you can use it just like before.

Deep Dive

Regression

Regression is awesome

K-NN

K-NN is simple, but awesome

Regression

Introduction to Linear Regression

This lesson builds on the ordinary linear regression concept, introduced in business statistics, to discuss linear regression as a machine learning task. Regression, or the creation of a predictive model from data, is one of the key machine learning tasks. By using linear regression, which can often be solved analytically, we will introduce the concept of minimizing a cost function to determine the optimal model parameters in a straight-forward manner.

Objectives

By the end of this lesson, you will be able to

You were introduced to the concept of linear regression by learning about simple linear regression. This initial approach treated linear regression as a statistical technique where the relation between independent variables (or features) and a dependent variable (or target) was determined by a mathematical relation. While powerful, the previous approach treated linear regression as a distinct, statistical approach to relating independent variables with a dependent variable. In this lesson, we instead treat linear regression as a machine learning task. As a result, we will use regression to fit a model to data. The model generated in this fashion can be explored in greater detail to either understand why the provided data follow the generated model (i.e., gain insight into the data), or the model can be used to generate new dependent values from future or unseen data (i.e., make predictions from the model).

We will use the tips data set. After loading this data, we display several rows, and next compute a simple linear regression to predict the tip feature from the total_bill feature.

For this section we need to load libraries:

#load data. 
#curl package lets us download data from a website with the proper location
#check the packages tab and see if you have curl
#try following
  #?curl

library(curl)

load(curl("https://raw.githubusercontent.com/Professor-Hunt/ACC8143/main/data/tips.rda"))
head(tips,5)
# A tibble: 5 x 7
  total_bill   tip sex    smoker day   time    size
       <dbl> <dbl> <chr>  <chr>  <chr> <chr>  <dbl>
1       17.0  1.01 Female No     Sun   Dinner     2
2       10.3  1.66 Male   No     Sun   Dinner     3
3       21.0  3.5  Male   No     Sun   Dinner     3
4       23.7  3.31 Male   No     Sun   Dinner     2
5       24.6  3.61 Female No     Sun   Dinner     4
#view the whole dataset
knitr::kable(tips)%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4
8.77 2.00 Male No Sun Dinner 2
26.88 3.12 Male No Sun Dinner 4
15.04 1.96 Male No Sun Dinner 2
14.78 3.23 Male No Sun Dinner 2
10.27 1.71 Male No Sun Dinner 2
35.26 5.00 Female No Sun Dinner 4
15.42 1.57 Male No Sun Dinner 2
18.43 3.00 Male No Sun Dinner 4
14.83 3.02 Female No Sun Dinner 2
21.58 3.92 Male No Sun Dinner 2
10.33 1.67 Female No Sun Dinner 3
16.29 3.71 Male No Sun Dinner 3
16.97 3.50 Female No Sun Dinner 3
20.65 3.35 Male No Sat Dinner 3
17.92 4.08 Male No Sat Dinner 2
20.29 2.75 Female No Sat Dinner 2
15.77 2.23 Female No Sat Dinner 2
39.42 7.58 Male No Sat Dinner 4
19.82 3.18 Male No Sat Dinner 2
17.81 2.34 Male No Sat Dinner 4
13.37 2.00 Male No Sat Dinner 2
12.69 2.00 Male No Sat Dinner 2
21.70 4.30 Male No Sat Dinner 2
19.65 3.00 Female No Sat Dinner 2
9.55 1.45 Male No Sat Dinner 2
18.35 2.50 Male No Sat Dinner 4
15.06 3.00 Female No Sat Dinner 2
20.69 2.45 Female No Sat Dinner 4
17.78 3.27 Male No Sat Dinner 2
24.06 3.60 Male No Sat Dinner 3
16.31 2.00 Male No Sat Dinner 3
16.93 3.07 Female No Sat Dinner 3
18.69 2.31 Male No Sat Dinner 3
31.27 5.00 Male No Sat Dinner 3
16.04 2.24 Male No Sat Dinner 3
17.46 2.54 Male No Sun Dinner 2
13.94 3.06 Male No Sun Dinner 2
9.68 1.32 Male No Sun Dinner 2
30.40 5.60 Male No Sun Dinner 4
18.29 3.00 Male No Sun Dinner 2
22.23 5.00 Male No Sun Dinner 2
32.40 6.00 Male No Sun Dinner 4
28.55 2.05 Male No Sun Dinner 3
18.04 3.00 Male No Sun Dinner 2
12.54 2.50 Male No Sun Dinner 2
10.29 2.60 Female No Sun Dinner 2
34.81 5.20 Female No Sun Dinner 4
9.94 1.56 Male No Sun Dinner 2
25.56 4.34 Male No Sun Dinner 4
19.49 3.51 Male No Sun Dinner 2
38.01 3.00 Male Yes Sat Dinner 4
26.41 1.50 Female No Sat Dinner 2
11.24 1.76 Male Yes Sat Dinner 2
48.27 6.73 Male No Sat Dinner 4
20.29 3.21 Male Yes Sat Dinner 2
13.81 2.00 Male Yes Sat Dinner 2
11.02 1.98 Male Yes Sat Dinner 2
18.29 3.76 Male Yes Sat Dinner 4
17.59 2.64 Male No Sat Dinner 3
20.08 3.15 Male No Sat Dinner 3
16.45 2.47 Female No Sat Dinner 2
3.07 1.00 Female Yes Sat Dinner 1
20.23 2.01 Male No Sat Dinner 2
15.01 2.09 Male Yes Sat Dinner 2
12.02 1.97 Male No Sat Dinner 2
17.07 3.00 Female No Sat Dinner 3
26.86 3.14 Female Yes Sat Dinner 2
25.28 5.00 Female Yes Sat Dinner 2
14.73 2.20 Female No Sat Dinner 2
10.51 1.25 Male No Sat Dinner 2
17.92 3.08 Male Yes Sat Dinner 2
27.20 4.00 Male No Thur Lunch 4
22.76 3.00 Male No Thur Lunch 2
17.29 2.71 Male No Thur Lunch 2
19.44 3.00 Male Yes Thur Lunch 2
16.66 3.40 Male No Thur Lunch 2
10.07 1.83 Female No Thur Lunch 1
32.68 5.00 Male Yes Thur Lunch 2
15.98 2.03 Male No Thur Lunch 2
34.83 5.17 Female No Thur Lunch 4
13.03 2.00 Male No Thur Lunch 2
18.28 4.00 Male No Thur Lunch 2
24.71 5.85 Male No Thur Lunch 2
21.16 3.00 Male No Thur Lunch 2
28.97 3.00 Male Yes Fri Dinner 2
22.49 3.50 Male No Fri Dinner 2
5.75 1.00 Female Yes Fri Dinner 2
16.32 4.30 Female Yes Fri Dinner 2
22.75 3.25 Female No Fri Dinner 2
40.17 4.73 Male Yes Fri Dinner 4
27.28 4.00 Male Yes Fri Dinner 2
12.03 1.50 Male Yes Fri Dinner 2
21.01 3.00 Male Yes Fri Dinner 2
12.46 1.50 Male No Fri Dinner 2
11.35 2.50 Female Yes Fri Dinner 2
15.38 3.00 Female Yes Fri Dinner 2
44.30 2.50 Female Yes Sat Dinner 3
22.42 3.48 Female Yes Sat Dinner 2
20.92 4.08 Female No Sat Dinner 2
15.36 1.64 Male Yes Sat Dinner 2
20.49 4.06 Male Yes Sat Dinner 2
25.21 4.29 Male Yes Sat Dinner 2
18.24 3.76 Male No Sat Dinner 2
14.31 4.00 Female Yes Sat Dinner 2
14.00 3.00 Male No Sat Dinner 2
7.25 1.00 Female No Sat Dinner 1
38.07 4.00 Male No Sun Dinner 3
23.95 2.55 Male No Sun Dinner 2
25.71 4.00 Female No Sun Dinner 3
17.31 3.50 Female No Sun Dinner 2
29.93 5.07 Male No Sun Dinner 4
10.65 1.50 Female No Thur Lunch 2
12.43 1.80 Female No Thur Lunch 2
24.08 2.92 Female No Thur Lunch 4
11.69 2.31 Male No Thur Lunch 2
13.42 1.68 Female No Thur Lunch 2
14.26 2.50 Male No Thur Lunch 2
15.95 2.00 Male No Thur Lunch 2
12.48 2.52 Female No Thur Lunch 2
29.80 4.20 Female No Thur Lunch 6
8.52 1.48 Male No Thur Lunch 2
14.52 2.00 Female No Thur Lunch 2
11.38 2.00 Female No Thur Lunch 2
22.82 2.18 Male No Thur Lunch 3
19.08 1.50 Male No Thur Lunch 2
20.27 2.83 Female No Thur Lunch 2
11.17 1.50 Female No Thur Lunch 2
12.26 2.00 Female No Thur Lunch 2
18.26 3.25 Female No Thur Lunch 2
8.51 1.25 Female No Thur Lunch 2
10.33 2.00 Female No Thur Lunch 2
14.15 2.00 Female No Thur Lunch 2
16.00 2.00 Male Yes Thur Lunch 2
13.16 2.75 Female No Thur Lunch 2
17.47 3.50 Female No Thur Lunch 2
34.30 6.70 Male No Thur Lunch 6
41.19 5.00 Male No Thur Lunch 5
27.05 5.00 Female No Thur Lunch 6
16.43 2.30 Female No Thur Lunch 2
8.35 1.50 Female No Thur Lunch 2
18.64 1.36 Female No Thur Lunch 3
11.87 1.63 Female No Thur Lunch 2
9.78 1.73 Male No Thur Lunch 2
7.51 2.00 Male No Thur Lunch 2
14.07 2.50 Male No Sun Dinner 2
13.13 2.00 Male No Sun Dinner 2
17.26 2.74 Male No Sun Dinner 3
24.55 2.00 Male No Sun Dinner 4
19.77 2.00 Male No Sun Dinner 4
29.85 5.14 Female No Sun Dinner 5
48.17 5.00 Male No Sun Dinner 6
25.00 3.75 Female No Sun Dinner 4
13.39 2.61 Female No Sun Dinner 2
16.49 2.00 Male No Sun Dinner 4
21.50 3.50 Male No Sun Dinner 4
12.66 2.50 Male No Sun Dinner 2
16.21 2.00 Female No Sun Dinner 3
13.81 2.00 Male No Sun Dinner 2
17.51 3.00 Female Yes Sun Dinner 2
24.52 3.48 Male No Sun Dinner 3
20.76 2.24 Male No Sun Dinner 2
31.71 4.50 Male No Sun Dinner 4
10.59 1.61 Female Yes Sat Dinner 2
10.63 2.00 Female Yes Sat Dinner 2
50.81 10.00 Male Yes Sat Dinner 3
15.81 3.16 Male Yes Sat Dinner 2
7.25 5.15 Male Yes Sun Dinner 2
31.85 3.18 Male Yes Sun Dinner 2
16.82 4.00 Male Yes Sun Dinner 2
32.90 3.11 Male Yes Sun Dinner 2
17.89 2.00 Male Yes Sun Dinner 2
14.48 2.00 Male Yes Sun Dinner 2
9.60 4.00 Female Yes Sun Dinner 2
34.63 3.55 Male Yes Sun Dinner 2
34.65 3.68 Male Yes Sun Dinner 4
23.33 5.65 Male Yes Sun Dinner 2
45.35 3.50 Male Yes Sun Dinner 3
23.17 6.50 Male Yes Sun Dinner 4
40.55 3.00 Male Yes Sun Dinner 2
20.69 5.00 Male No Sun Dinner 5
20.90 3.50 Female Yes Sun Dinner 3
30.46 2.00 Male Yes Sun Dinner 5
18.15 3.50 Female Yes Sun Dinner 3
23.10 4.00 Male Yes Sun Dinner 3
15.69 1.50 Male Yes Sun Dinner 2
19.81 4.19 Female Yes Thur Lunch 2
28.44 2.56 Male Yes Thur Lunch 2
15.48 2.02 Male Yes Thur Lunch 2
16.58 4.00 Male Yes Thur Lunch 2
7.56 1.44 Male No Thur Lunch 2
10.34 2.00 Male Yes Thur Lunch 2
43.11 5.00 Female Yes Thur Lunch 4
13.00 2.00 Female Yes Thur Lunch 2
13.51 2.00 Male Yes Thur Lunch 2
18.71 4.00 Male Yes Thur Lunch 3
12.74 2.01 Female Yes Thur Lunch 2
13.00 2.00 Female Yes Thur Lunch 2
16.40 2.50 Female Yes Thur Lunch 2
20.53 4.00 Male Yes Thur Lunch 4
16.47 3.23 Female Yes Thur Lunch 3
26.59 3.41 Male Yes Sat Dinner 3
38.73 3.00 Male Yes Sat Dinner 4
24.27 2.03 Male Yes Sat Dinner 2
12.76 2.23 Female Yes Sat Dinner 2
30.06 2.00 Male Yes Sat Dinner 3
25.89 5.16 Male Yes Sat Dinner 4
48.33 9.00 Male No Sat Dinner 4
13.27 2.50 Female Yes Sat Dinner 2
28.17 6.50 Female Yes Sat Dinner 3
12.90 1.10 Female Yes Sat Dinner 2
28.15 3.00 Male Yes Sat Dinner 5
11.59 1.50 Male Yes Sat Dinner 2
7.74 1.44 Male Yes Sat Dinner 2
30.14 3.09 Female Yes Sat Dinner 4
12.16 2.20 Male Yes Fri Lunch 2
13.42 3.48 Female Yes Fri Lunch 2
8.58 1.92 Male Yes Fri Lunch 1
15.98 3.00 Female No Fri Lunch 3
13.42 1.58 Male Yes Fri Lunch 2
16.27 2.50 Female Yes Fri Lunch 2
10.09 2.00 Female Yes Fri Lunch 2
20.45 3.00 Male No Sat Dinner 4
13.28 2.72 Male No Sat Dinner 2
22.12 2.88 Female Yes Sat Dinner 2
24.01 2.00 Male Yes Sat Dinner 4
15.69 3.00 Male Yes Sat Dinner 3
11.61 3.39 Male No Sat Dinner 2
10.77 1.47 Male No Sat Dinner 2
15.53 3.00 Male Yes Sat Dinner 2
10.07 1.25 Male No Sat Dinner 2
12.60 1.00 Male Yes Sat Dinner 2
32.83 1.17 Male Yes Sat Dinner 2
35.83 4.67 Female No Sat Dinner 3
29.03 5.92 Male No Sat Dinner 3
27.18 2.00 Female Yes Sat Dinner 2
22.67 2.00 Male Yes Sat Dinner 2
17.82 1.75 Male No Sat Dinner 2
18.78 3.00 Female No Thur Dinner 2

Perform simple linear regression

OLS1<-lm(formula=tip~total_bill,data=tips)
#general output
OLS1

Call:
lm(formula = tip ~ total_bill, data = tips)

Coefficients:
(Intercept)   total_bill  
     0.9203       0.1050  
#more common output
summary(OLS1)

Call:
lm(formula = tip ~ total_bill, data = tips)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1982 -0.5652 -0.0974  0.4863  3.7434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.920270   0.159735   5.761 2.53e-08 ***
total_bill  0.105025   0.007365  14.260  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.022 on 242 degrees of freedom
Multiple R-squared:  0.4566,    Adjusted R-squared:  0.4544 
F-statistic: 203.4 on 1 and 242 DF,  p-value: < 2.2e-16
#correlation or R-squared...
cor(tips$total_bill,tips$tip)^2
[1] 0.4566166

Formalism

Formally, this simple linear model related the independent variables \(x_i\) to the dependent variables \(y_i\) in our data set via two parameters: an intercept, and a slope. Mathematically, we express this relation in the following form:

\[ f(x_i) = \beta * x_i + \alpha + \epsilon_i \]

where \(\epsilon_i\) accounts for the difference between the model and the data for each data point \((x_i,y_i)\). If we have a perfect model, these errors, \(\epsilon_i\), are all zero, and \(y_i = f(x_i)\). In real life, however, the error terms rarely vanish because even if the original relationship is perfect noise creeps into the measurement process.

As a result, in this simple example we wish to determine the model parameters: \(\beta_i\), and \(\alpha_i\) that minimize the values of \(\epsilon_i\). We could perform this process in an iterative manner, trying different values for the model parameters and measuring the error function. This approach is often used in machine learning, where we define a cost function that we seek to minimize by selecting the best model parameters.

In the case of a simple linear model, we have several potential cost (or loss) functions that we could seek to minimize, but we will use the common l2-norm: \(\epsilon_i^2 = \left( \ y_i - f(x_i) \ \right)^2\), where \(f(x_i)\) is defined by our model parameters. We demonstrate this approach visually in the following code block, where we minimize the sum of the l2-norm model residuals, which is done by finding the best model parameters: \(\hat{\beta}\), and \(\hat{\alpha}\).

#Get some data
AnsDat<-anscombe%>%
  select(y1,x1)

#extract x and y columns
Y<-AnsDat$y1
X<-AnsDat$x1

#find the number of data points
n<-nrow(AnsDat)

#determine mean values
mean_x<-mean(X,na.rm = TRUE)
mean_y<-mean(Y,na.rm = TRUE)

#determine best fit model parameters (from simple linear regression)
beta = sum((X - mean_x) * (Y - mean_y)) / sum((X - mean_x)**2)
beta
[1] 0.5000909
alpha = mean_y - beta * mean_x
alpha
[1] 3.000091
#lets double check
summary(lm(formula=Y~X,data=AnsDat))

Call:
lm(formula = Y ~ X, data = AnsDat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92127 -0.45577 -0.04136  0.70941  1.83882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0001     1.1247   2.667  0.02573 * 
X             0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6665,    Adjusted R-squared:  0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Plots

library(ggplot2)

#create regression plot
ggplot(AnsDat,aes(x1, y1)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  geom_segment(aes(x=X, xend=X, y=Y, yend=lm(Y~X)$fitted.values, color="error"))+
  theme_minimal() +
  labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) + 
  theme(legend.title = element_blank())

Cost Function

This simple example demonstrates a fundamental concept in machine learning, namely the minimization of a cost (or loss) function, which quantifies how well a model represents a data set. For a given data set, the cost function is completely specified by the model parameters, thus a more complex model has a more complex cost function, which can become difficult to minimize. To clarify this point, we now turn to the exploration of the shape of cost functions.

For simplicity, we start with a one-dimensional cost function, a linear model with no intercept: \(f(x_i) = \beta x_i\). In the following code cell, we compute the cost function for a given data set as a function of the unknown parameter \(\beta\). In this case, the minimum is easy to visualize, given the steepness of the cost function around the minimum.

#define our betas
betas<-seq(-4,4,length.out=100)

#define our cost function
l2n = sapply(as.matrix(betas), function(m) log(sqrt(sum((as.matrix(tips$tip) - m*as.matrix(tips$total_bill))^2))))  # The L2-norm
library(ggplot2)
costplot<-as.data.frame(cbind(betas,l2n))
#create regression plot
ggplot(costplot,aes(betas, l2n)) +
  geom_point(color="blue") + geom_line()+
  geom_vline(xintercept=0, color="red")

In general, however, we face two challenges:

  1. the cost function will likely be more complex, and

  2. our data will be higher dimensional.

In general, we must employ a (potentially) complex mathematical technique to find the (hopefully) global minimum of the cost function. We can increase the complexity of our cost function analysis by extending the original model to include both a slope and an intercept. We now must find the minimum of this two dimensional model, given our observed data. We do this in the following code cell where we generate a grid of values in our two parameters, and compute the cost function for these different parameter combinations.

To display the data which generates a sampling grid across potential values for the slope \(\beta\) and intercept \(\alpha\) in our model. We once again vectorize our cost function and broadcast it across the sampling grid. We accumulate the cost at each grid point and generate a two-dimensional image of the values of the cost function across our sampling grid. To make the image appear cleaner, we perform Gaussian interpolation between sample points.

As the following two-dimensional image displays, our cost function is not aligned with either parameter, but is steeper in the slope parameter and less steep in the intercept parameter. Thus, we would expect that small changes in the slope will quickly increase our cost (which we saw in the previous one-dimensional example), while small changes in the intercept will produce smaller changes in our cost function (note that the range for intercepts is much larger than the range for the slope parameters).

#define our betas
betas<-seq(-4,4,length.out=100)
alphas<-seq(-40,40,length.out=100)
## Generate a grid of X- and Y- values on which to predict
grid <-expand.grid(betas,alphas)
#define our cost function
l2n2 = mapply( function(m,b) log(sqrt(sum((as.matrix(tips$tip) - m*as.matrix(tips$total_bill) - b)^2))),as.matrix(grid$Var1),as.matrix(grid$Var2))  # The L2-norm
library(ggplot2)

ggplot(grid, aes(Var1, Var2)) +
  geom_raster(aes(fill=l2n2),show.legend = FALSE) +
  geom_point(color="deepskyblue3",aes(OLS1$coefficients[[2]],OLS1$coefficients[[1]]))+
  theme_minimal() +
  labs(x=expression(beta), y=expression(alpha), title=expression(paste("Cost function for"," ",y==beta*x+alpha))) +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) + 
  theme(legend.title = element_blank())

As we move to higher dimensional data sets or more complex cost functions, the challenge of finding the global minimum becomes increasingly difficult. As a result, many mathematical techniques have been developed to find the global minimum of a (potentially) complex function. The standard approach is gradient descent, where we use the fact that the first derivative (or gradient) measures the slope of a function at a given point. We can use the slope to infer which direction is downhill and thus travel (hopefully) towards the minimum.

A major challenge with this approach is the potential to become stuck in a local and not global minima. Thus, modifications are often added to reduce the likelihood of becoming stuck in a local minimum. One popular example of this approach is known as stochastic gradient descent. This algorithm employs standard gradient descent, but adds an occasional random jump in the parameter space to reduce the chances of being stuck in a local valley. Another, very different, approach to this problem is the use of genetic algorithms, which employ techniques from evolutionary biology to minimize the cost function.

For a mental picture of this process, imagine hiking in the mountains and flip the challenge to finding the highest peak, so we will use gradient ascent. Gradient ascent is similar to finding the local mountain peak and climbing it. This local peak might look like it is the largest, but a random jump away from the local peak might enable one to view much larger peaks beyond, which can subsequently be climbed with a new gradient ascent.

Whenever you perform machine learning in the future, you should keep in mind that the model that you generate for a given data set has generally resulted from the minimization of a cost function. Thus, there remains the possibility that with more effort, more data, or a better cost minimization strategy, a new, and better model may potentially exist.

#set the seed :)
set.seed(1)
#get our samples

#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(tips$tip, p = .6, list = FALSE, times = 1)

#look at the first few
#head(trainIndex)

#grab the data
tipsTrain <- tips[ trainIndex,]
tipsTest  <- tips[-trainIndex,]

Linear Regression

In the following code cells, we use the lm estimator to fit our sample data, plot the results, and finally display the fit coefficients.

The first code cell defines a function that will make two plots. The top plot is a comparison between a single independent variable (Total Bill) and the dependent variable (Tip). This plot differentiates the training data, the testing data, and the linear model. The bottom plot displays the model residuals (dependent variable - model result) as a function of the independent variable. The primary benefit of this plot is the ability to identify any structure in the residuals, which can indicate a bad model. For example, if the residual plot shows a linear relationship, that indicates the original model incorrectly related the independent and dependent variables.

In the following code cells, we first compute a linear fit with no intercept, after which we compute a linear fit with both a slope and an intercept. The fit results are displayed as well as the regression and residual plots.

The code below computes a regression with no intercept.

#fit simple linear regression model
model_noint <- lm(tip ~ 0+total_bill , data = tipsTrain)

noint_results<-predict(model_noint,tipsTest)
###compute fit
summary(model_noint)

Call:
lm(formula = tip ~ 0 + total_bill, data = tipsTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7910 -0.1871  0.2019  0.7518  4.1204 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
total_bill 0.142010   0.004331   32.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.153 on 147 degrees of freedom
Multiple R-squared:  0.8797,    Adjusted R-squared:  0.8789 
F-statistic:  1075 on 1 and 147 DF,  p-value: < 2.2e-16
knitr::kable(caret::RMSE(noint_results,tipsTest$tip),col.names = "RMSE")
RMSE
0.9808281

Below is the regression plot for the model with no intercept.

tipsTest$Sample<-"Testing"
tipsTrain$Sample<-"Training"

Combined_Tips<-rbind(tipsTest,tipsTrain)
#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
  geom_point(alpha=.5) +
  theme_minimal() +
  labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
  geom_abline(aes(slope=model_noint$coefficients[[1]],intercept=0),color="red")

Below is a residual (error) plot.

library(tidyverse)
#create residuals
testwithpred<-as.data.frame(cbind(noint_results,tipsTest))
#create residuals
testwithpred<-testwithpred%>%
  rename(prediction=noint_results)%>%
  mutate(error=tip-prediction)

#create regression plot with customized style
ggplot(testwithpred,aes(x=total_bill, y=error)) +
  geom_point(alpha=.5,color="deepskyblue") +
  theme_minimal() +
  labs(x='Total Bill', y='Error', title='Regression Error Plot') +
  theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
  geom_hline(yintercept=0,color="red",linetype="dashed")

Link to some good examples of interpreting residual plots 🏫

Model with an intercept

#fit simple linear regression model
model_int <- lm(tip ~ total_bill , data = tipsTrain)

int_results<-predict(model_int,tipsTest)
###compute fit
summary(model_int)

Call:
lm(formula = tip ~ total_bill, data = tipsTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1243 -0.5729 -0.0703  0.4668  3.4083 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.018249   0.209997   4.849 3.14e-06 ***
total_bill  0.099788   0.009596  10.399  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.074 on 146 degrees of freedom
Multiple R-squared:  0.4255,    Adjusted R-squared:  0.4216 
F-statistic: 108.1 on 1 and 146 DF,  p-value: < 2.2e-16
knitr::kable(caret::RMSE(int_results,tipsTest$tip),col.names = "RMSE")
RMSE
0.9408934

Below is a model with an intercept

#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
  geom_point(alpha=.5) +
  theme_minimal() +
  labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
  geom_abline(aes(slope=model_int$coefficients[[2]],intercept=model_int$coefficients[[1]]),color="red")

Residual Plot

#create residuals
testwithpred2<-as.data.frame(cbind(int_results,tipsTest))
#create residuals
testwithpred2<-testwithpred2%>%
  rename(prediction=int_results)%>%
  mutate(error=tip-prediction)

#create regression plot with customized style
ggplot(testwithpred2,aes(x=total_bill, y=error)) +
  geom_point(alpha=.5,color="deepskyblue") +
  theme_minimal() +
  labs(x='Total Bill', y='Error', title='Regression Error Plot') +
  theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
  geom_hline(yintercept=0,color="red",linetype="dashed")

Multivariate Regression

Often, using more data will result in more accurate models, since finer details can be captured. For example, if we see structure in a residual plot, the easiest solution is often to add additional independent variables to our model, which results in a multivariate linear regression model. The only major change required to our previous model building code is the expansion of our equation in the lm() function to include the additional independent variables.

To demonstrate building a multi-variate regression model, the following code uses both the total_bill and size features from the tips data set to use as independent variables. The tip feature is used as the dependent variable.

The following code generates a multi-variate linear model, displays the model parameters, and displays the regression and residual plots. To make the regression plot, we must use only one feature (in this case the total_bill). As a result, when we display the generated model, we get a series of lines that are the projections of the multi-variate model on this two-dimensional figure.

#fit simple linear regression model
model_multi <- lm(tip ~ total_bill+size , data = tipsTrain)

multi_results<-predict(model_multi,tipsTest)
###compute fit
summary(model_multi)

Call:
lm(formula = tip ~ total_bill + size, data = tipsTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5272 -0.5926 -0.0678  0.5428  3.3375 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.46842    0.24272   1.930 0.055576 .  
total_bill   0.07291    0.01135   6.425 1.77e-09 ***
size         0.41764    0.10448   3.997 0.000102 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.023 on 145 degrees of freedom
Multiple R-squared:  0.4825,    Adjusted R-squared:  0.4754 
F-statistic:  67.6 on 2 and 145 DF,  p-value: < 2.2e-16
knitr::kable(caret::RMSE(multi_results,tipsTest$tip),col.names = "RMSE")
RMSE
1.039275
library(plotly)

scatter.plot<-plot_ly(Combined_Tips, x = ~total_bill, y = ~size, z = ~tip, color = ~Sample, colors = c('lightblue', 'violet'))%>% 
  add_markers(size=6)%>% 
  layout(scene = list(
          xaxis = list(title = 'Total Bill'),
          yaxis = list(title = 'Size'),
          zaxis = list(title = 'Tip')))

#scatter.plot

library(reshape2)
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.05

#Setup Axis
axis_x <- seq(min(tipsTest$total_bill), max(tipsTest$total_bill), by = graph_reso)
axis_y <- seq(min(tipsTest$size), max(tipsTest$size), by = graph_reso)

#Sample points
lm_surface <- expand.grid(total_bill = axis_x,size = axis_y,KEEP.OUT.ATTRS = F)
lm_surface$tips <- predict(model_multi, newdata = lm_surface)
lm_surface <- acast(lm_surface, size ~ total_bill, value.var = "tips")

scatter.plot<- add_trace(p = scatter.plot,
                       z = lm_surface,
                       x = axis_x,
                       y = axis_y,
                       type = "surface",colorscale = list(c(0, 1), c("wheat", "royalblue")))%>%
                layout(legend = list(x = -.1, y = 1), title="Regression Plot")

scatter.plot
#create residuals
testwithpred3<-as.data.frame(cbind(multi_results,tipsTest))
#create residuals
testwithpred3<-testwithpred3%>%
  rename(prediction=multi_results)%>%
  mutate(error=tip-prediction)


library(plotly)

scatter.plot<-plot_ly(testwithpred3, x = ~total_bill, y = ~size, z = ~error)%>% 
  add_markers(size=6)%>% 
  layout(scene = list(
          xaxis = list(title = 'Total Bill'),
          yaxis = list(title = 'Size'),
          zaxis = list(title = 'Error')))

#scatter.plot

library(reshape2)
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.05

#Setup Axis
axis_x <- seq(min(tipsTest$total_bill), max(tipsTest$total_bill), by = graph_reso)
axis_y <- seq(min(tipsTest$size), max(tipsTest$size), by = graph_reso)

#Sample points
lm_surface <- expand.grid(total_bill = axis_x,size = axis_y,KEEP.OUT.ATTRS = F)
lm_surface$error <-rep(0,nrow(lm_surface))
lm_surface <- acast(lm_surface, size ~ total_bill, value.var = "error")

scatter.plot<- add_trace(p = scatter.plot,
                       z = lm_surface,
                       x = axis_x,
                       y = axis_y,
                       type = "surface",
                       colorscale = list(c(0,1), c("red","red")))%>%
                       layout(showlegend=FALSE, title= "Error Plot")

hide_colorbar(scatter.plot)

Exercise 1

  1. Run a simple regression to predict total_bill with tip. What is the RSquared? What is the RMSE?
  2. Plot the regression and the residuals from number 1.
  3. Run a multivariate regression to predict total_bill with tip and size. What is the RSquared? What is the RMSE?
  4. Try to make a 3D plot…but do not spend more than 10-15 minutes on this one.

Categorical Variables

Many data sets contain features that are non-numerical. For example, the tips data set contains a day feature that can take one of four values: Thur, Fri, Sat, and Sun. This data set also contains a sex feature that can be Female or Male, and a smoker feature that can be No or Yes. Each of these features are categorical features, in that they can only take on one of a limited number of possible values. In general, the possible states are fixed, such as the sex, smoker, and day features discussed previously.

Categorical features can take several forms. For example, a categorical feature, such as sex or smoker that can take on one of two values is known as a binary feature. Furthermore, categorical features can also be categorized into nominal and ordinal features (note that other classes are also possible, but beyond the scope of this class).

A nominal feature either is in a category or it isn’t, and there are no relations between the different categories. For example, the sex category is nominal since there is no numerical relation or ordering among the possible values. On the other hand, an ordinal feature is a categorical feature where the possible values have an intrinsic relationship. For example, if we encode the results of a race as first, second, and third, these values have a relationship, in that first comes before the other two, and the difference between first and second is the same as between second and third. In our tips example, we could treat the day features in this manner, since the days often are treated as having an ordinal relationship.

df = data.frame(Color = c("Red", "Blue", "Green", "Blue", "Blue", "Red"))

knitr::kable(df)
Color
Red
Blue
Green
Blue
Blue
Red

This encoding is fine if the data are ordinal, but in this case, our colors are likely nominal and there is no numerical relationship between the different features. Thus, we need to perform an additional transformation to convert our data into a numerical format that a machine learning model can effectively process. To do this, a commonly used approach known as One Hot Encoding is used. This approach generates a new feature for each possible value in our category. Thus, for our four colors, we need four features. These features will be binary, in that a value of zero indicates that the feature is not present for the specific instance, and a value of one indicates it is present. Furthermore, only one set of these new features can be present (or on) for a specific instance.

library(varhandle)
dumvars<-as.data.frame(to.dummy(df$Color,"dum"))

knitr::kable(dumvars)
dum.Blue dum.Green dum.Red
0 0 1
1 0 0
0 1 0
1 0 0
1 0 0
0 0 1

Linear Regression with Categorical Variables

We fit the model, display the fit coefficients, compute the model performance, and finally display the regression model plot and the residual model plot. In this case, our new model performs slightly worse than the original single variable linear regression model. This suggests that the day of the week is not an important variable in the underlying relationship between total_bill and tip. By evaluating other feature combinations, you may be able to find a better predicting model.

dumvars_Train<-as.data.frame(to.dummy(tipsTrain$day,"dum"))

dumvars_Train<-cbind(dumvars_Train,tipsTrain)

#fit simple linear regression model

model_dum1 <- lm(tip ~total_bill+ dum.Sat+dum.Sun+dum.Thur+dum.Fri , data = dumvars_Train)

dumvars_Test<-as.data.frame(to.dummy(tipsTest$day,"dum"))

dumvars_Test<-cbind(dumvars_Test,tipsTest)

dum1_results<-predict(model_dum1,dumvars_Test)
###compute fit
summary(model_dum1)

Call:
lm(formula = tip ~ total_bill + dum.Sat + dum.Sun + dum.Thur + 
    dum.Fri, data = dumvars_Train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0514 -0.5612 -0.0356  0.4624  3.2878 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.023612   0.333353   3.071  0.00256 ** 
total_bill   0.098554   0.010186   9.675  < 2e-16 ***
dum.Sat     -0.037771   0.330383  -0.114  0.90914    
dum.Sun      0.124046   0.343880   0.361  0.71884    
dum.Thur    -0.004739   0.333255  -0.014  0.98867    
dum.Fri            NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.083 on 143 degrees of freedom
Multiple R-squared:  0.4276,    Adjusted R-squared:  0.4116 
F-statistic: 26.71 on 4 and 143 DF,  p-value: < 2.2e-16
knitr::kable(caret::RMSE(dum1_results,tipsTest$tip),col.names = "RMSE")
RMSE
0.9394876
#fit simple linear regression model
model_dum2 <- lm(tip ~ total_bill+factor(day) , data = tipsTrain)

dum2_results<-predict(model_dum2,tipsTest)
###compute fit
summary(model_dum2)

Call:
lm(formula = tip ~ total_bill + factor(day), data = tipsTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0514 -0.5612 -0.0356  0.4624  3.2878 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.023612   0.333353   3.071  0.00256 ** 
total_bill       0.098554   0.010186   9.675  < 2e-16 ***
factor(day)Sat  -0.037771   0.330383  -0.114  0.90914    
factor(day)Sun   0.124046   0.343880   0.361  0.71884    
factor(day)Thur -0.004739   0.333255  -0.014  0.98867    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.083 on 143 degrees of freedom
Multiple R-squared:  0.4276,    Adjusted R-squared:  0.4116 
F-statistic: 26.71 on 4 and 143 DF,  p-value: < 2.2e-16
knitr::kable(caret::RMSE(dum2_results,tipsTest$tip),col.names = "RMSE")
RMSE
0.9394876
stargazer::stargazer(model_dum2,title="Fancy Reg Table",
          type = "html",
          float = TRUE,
          report = "vcs*",
          no.space = TRUE,
          header=FALSE,
          single.row = TRUE,
          #font.size = "small",
          intercept.bottom = F)
Fancy Reg Table
Dependent variable:
tip
Constant 1.024 (0.333)***
total_bill 0.099 (0.010)***
factor(day)Sat -0.038 (0.330)
factor(day)Sun 0.124 (0.344)
factor(day)Thur -0.005 (0.333)
Observations 148
R2 0.428
Adjusted R2 0.412
Residual Std. Error 1.083 (df = 143)
F Statistic 26.705*** (df = 4; 143)
Note: p<0.1; p<0.05; p<0.01

Regression Plot

#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
  geom_point(alpha=.5) +
  theme_minimal() +
  labs(x='total bill', y='tip', title='Linear Regression Plot') +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
  geom_abline(aes(slope=model_dum2$coefficients[[2]],
                  intercept=model_int$coefficients[[1]]),color="red")+
    geom_abline(aes(slope=model_dum2$coefficients[[2]],
                  intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[3]]),color="red")+
  geom_abline(aes(slope=model_dum2$coefficients[[2]],
                  intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[4]]),color="red")+
  geom_abline(aes(slope=model_dum2$coefficients[[2]],
                  intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[5]]),color="red")

Residual Plot

#create residuals
testwithpred4<-as.data.frame(cbind(dum2_results,tipsTest))
#create residuals
testwithpred4<-testwithpred4%>%
  rename(prediction=dum2_results)%>%
  mutate(error=tip-prediction)

#create regression plot with customized style
ggplot(testwithpred4,aes(x=total_bill, y=error)) +
  geom_point(alpha=.5,color="deepskyblue") +
  theme_minimal() +
  labs(x='Total Bill', y='Error', title='Regression Error Plot') +
  theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
  geom_hline(yintercept=0,color="red",linetype="dashed")

Interpreting the coefficients?

Exercise 2

  1. We used multi-variate linear regression to predict the tip feature from the total_bill and categorical day features. Repeat this process, but use the total_bill, size, sex, and time features. Has the prediction performance improved, i.e., what is the RSquared and RMSE?
  2. Interpret 2 of the coefficients.

  1. The is borrowed content from https://insights.principa.co.za/4-types-of-data-analytics-descriptive-diagnostic-predictive-prescriptive↩︎

  2. This is a narrow definition because you can also impute with summary statistics such as mean or median.↩︎

  3. Content for this caret portion is borrowed from https://www.rebeccabarter.com/blog/2017-11-17-caret_tutorial/↩︎

  4. https://data.library.virginia.edu/is-r-squared-useless/↩︎

  5. https://www.r-bloggers.com/2013/11/computing-and-visualizing-pca-in-r/↩︎